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^sj Abstract 

?-H The change of two orders of magnitude in the new DCF of SRE'lO, 

♦^ relative to the old DCF evaluation criterion, posed a difficult challenge 

"^^ for participants and evaluator alike. Initially, participants were at a loss 

f.-^ as to how to calibrate their systems, while the evaluator underestimated 

I the required number of evaluation trials. After the fact, it is now obvious 

that both calibration and evaluation require very large sets of trials. This 

I I poses the challenges of (i) how to decide what number of trials is enough, 

Ph and (ii) how to process such large data sets with reasonable memory and 

■^T CPU requirements. 

^ After SRETO, at the BOSARIS Workshop, we built solutions to these 

C^ problems into the freely available BOSARIS Toolkit. This paper explains 

^7 the principles and algorithms behind this toolkit. The main contributions 

I ' of the toolkit are: 

I 1. The Normalized Bayes Error-Rate Plot, which analyses likelihood- 

^ ratio calibration over a wide range of DCF operating points. These 

ly^ plots also help in judging the adequacy of the sizes of calibration 

^>^ and evaluation databases. 
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2. Efficient algorithms to compute DCF and minDCF for large score 
files, over the range of operating points required by these plots. 

>— «s 3. A new score file format, which facilitates working with very large 

trial lists. 
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4. A faster logistic regression optimizer for fusion and calibration. 

5. A principled way to define equal error rate, which is of practical 
interest when the absolute error count is small. 



1 Introduction 

The BOSARIS Toolkit provides MATLAB code for calibrating, fusing and eval- 
uating scores from (automatic) binary classifiers. It was developed to provide 
solutions for automatic speaker recognition, but we envision that much of the 
code will have wider applicability for other biometric and/or forensics problems, 
where the calibration of likelihood-ratios is of interest. This document serves 
as a user guide, to explain theory and algorithms and is complementary to the 
user manual. 



The theory behind the toolkit is based on the Ph.D. dissertation [T], which 
can be consulted for further details. The core implementation (code) was written 
by the authors of this document, as part of the ABC: AGNITIO, BUT, CRIM 
submission for the 2010 NIST Speaker Recognition Evaluation (SRE'lO) [2]. 
After the evaluation, at the BOSARIS Workshopjjwe collaborated with a wider 
group of researchers to make these algorithms available in toolkit formr] 

This document is organized in three sections: Theory is the bulk of the 
document, which explains what the toolkit does and why. Algorithms explains 
how the toolkit does it. Code gives a high-level summary of the implementation. 

2 Theory 

This section provides the theoretical framework which is necessary for a good 
understanding of the BOSARIS Toolkit. For the typical speaker recognition 
expert, part of this material should be very familiar, while other parts may 
be new. All readers should nevertheless review the familiar parts, where the 
terminology for discussing the new material will be established. This section is 
organized as follows: 

• 



Subsection |2.1| discusses the problem of running out of errors and the way 
this is addressed in the toolkit. 



Subsection 2.2 reviews Bayes decision theory, while 2.3 reviews NIST's 



• 



DCF criterion for evaluating goodness of decisions. 

Subsection |2 .4| introduces the idea that we can evaluate system outputs in 
the form of likelihood-ratios, rather than decisions. The key is to let the 
evaluator make the decisions, at the theoretically optimal Bayes threshold. 
Subsection |2.5| develops this idea into practical evaluation criteria. 



• Subsection |2.6| discusses perhaps unfamiliar relationships between the fa- 
mihar evaluation tools, ROC/DET, EER and minDCF. 

• Subsection |2.7| discusses solutions for fusing and calibrating scores. 



2.1 Sampling effects 

All of the evaluation methods used in this toolkit explicitly or implicitly depend 
on estimating various error-rates by counting occurrences of those errors in a 
supervised evaluation database. The error-rates depend not only on the accur- 
acy of the system under evaluation, but also on the operating point. We explain 
operating points in more detail later. What is important here is that no matter 
what the accuracy of the system under evaluation, or no matter what the size 
of the evaluation database, there will be operating points where the error-rates 
become so small that no more errors are observed. More generally, there will 
be operating points where the numbers of observed errors become so small that 
the error-rate estimates become unreliable. 

There are various frequentist (confidence interval) or Bayesian (credible in- 
terval) methods to theoretically quantify the accuracy of such estimates — see 



^See http : //speech . f it . vutbr . cz/Morkshops/bosaris2 010l 
^Available at: .http : //sites . google . com/site/bosaristoollcit/ 



for example [5] and references therein. The results of any such analysis will 
depend on various modelling assumptions. 

For the speaker recognition problem, one such analysis, Doddington's Rule of 
30 |4], is rendered tractable via the assumption of independent Bernoulh trials [j 
This rule suggests one needs at least 30 errors to get a probably approximately 
correct error-rate estimate. In practice, we have found this rule to work well. 
We get sensible results in both training and test, if we ensure that there are at 
least 30 misses and at least 30 false-alarms at the operating point of interest. 

2.1.1 Toolkit solution 

In the BOSARIS Toolkit, we address the problem by flagging on our plots (DET 
curves as well as normalized Bayes error-rate curves) the points at which the 
various error-rates drop below 30. It is up to the user of the toolkit to understand 
that regions on the plot beyond these flags must be treated with caution. 

2.1.2 In SRE'lO 

In SRE'lO, at the 'new DCF' operating point of interest, there was a scarcity 
of false-alarms, which we addressed by manufacturing many more non-target 
trials. (This was possible because the number of possible non-target trials grows 
quadratically with the number of speakers in the available data.) 
We used the rule of thumb that: 

• If we want to use a database for calibration/fusion, that database has to 
be sufficiently large so that the calibrated/fused system makes at least 30 
training errors of both types, at all operating points of interest. 

• If we want to use an independent database for testing/evaluation, the same 
holds. That database has to be sufficiently large so that the system makes 
at least 30 test errors of both types, at all operating points of interest. 

2.2 Bayes decision theory 

The toolkit is focused on the canonical speaker detection problem, where inde- 
pendent decisions must be made for independent trials, based on the output 
scores of an automatic speaker recognition system. In most of this section, we 
consider the case of making decisions by using the output scores of a single 
system. We defer fusion of multiple systems to subsection |2.7[ 

The input to the toolkit is in the form of scores calculated by the automatic 
system. We assume that for every trial, the system has calculated a scalar 
detection score and that a decision has to be made based on this score. The 
recipe for doing so is given by Bayes decision theory [5]. 

In the canonical detection problem, there are two alternative hypotheses, 
called target and non-target, exactly one of which must be true for every trial. 
By convention, larger (more positive) scores favour the target hypothesis and 
smaller (more negative) scores favour the non-target hypothesis. 

For every trial, an accept/reject decision is required. We define the out- 
come of a trial as the pair (hypothesis, decision), so that there are four possible 

^Are different scores of the same speaker independent? Are miss and false-alarm rates 
independent? 



outcomes. Two of these are considered to be errors: miss — (target, reject) 
and false-alarm = (non-target, accept). The other two outcomes are the correct 
outcomes. 

The consequence of an outcome is expressed as a cost function, which maps 
outcomes to positive real numbers. Without loss of generality (see [TJ section 
3.4] and [S]), we restrict attention to cost functions which assign zero cost to 
correct outcomes. This leaves two costs to be specified: Cmiss; the cost of a 
miss; and Cfa, the cost of a false-alarm. 

When given a score, say s, the Bayes decision chooses the option, accept or 
reject, that minimizes the risk. That is, we choose to accept if 

P(target|s, 7r)C,niss > P(non-target|s, 7r)Cfa (1) 

and to reject otherwise. The two risks being compared are products of costs 
and posterior probabilities. The posteriors are conditioned not only on s, but 
also on some independent prior information, which we represent as: 

TT = F(target) = 1 - P(non-target) (2) 

We refer to tt as the target prior, or simply as the prior. By using Bayes' rule 
and taking logs, we can rewrite the decision rule as follows: 

accept, if l{s) > 77, or reject otherwise. (3) 

where we have defined the log-likelihood-ratio: 

PUtB^^ (4) 

^ ' ^ P(s|non-target) ^ ' 



the Bayes decision threshold: 



and the prior log oddsrj 



V ^ log 7=r^ - logit TT (5) 



logit TT = log (6) 

i — TT 



We refer to the function ^ : M >-)> M as the calibration mapping. It maps the 
score, s, to the log-likelihood-ratio, £{s). Since log-likelihood-ratios follow the 
same convention as the scores (larger values favour the target hypothesis) , they 
are also scores. We shall therefore also refer to them as calibrated scores. On the 
other hand, scores are generally not calibrated and cannot do the work of log- 
likelihood-ratios: when scores are thresholded at the Bayes decision threshold, 
they usually do not make good decisions. 

The toolkit is concerned with: (i) evaluating the potential ability of the 
scores, s, to make Bayes decisions, even if the calibration mapping, £, is not 
available; (ii) creating such mappings, by training on a supervised calibration 
database; and (iii) evaluating the ability of the calibrated log- likelihood-ratios 
£{s) to make Bayes decisions. 



^The invertible function logit(p) = log yir- maps probabilities in [0, 1] to log odds in 

[—00, 00]. 



2.3 DCF: criterion for goodness of hard decisions 

The Bayes decision paradigm leads naturally to a recipe for evaluating the good- 
ness of detection decisions made on a database of supervised trials. In the 
Speaker Recognition Evaluations (SREs) of 1997 to the present (2010), NIST 
has required systems under evaluation to submit a hard accept/reject decision, 
as well as a score, for each trial. The primary evaluation criterion, called DCF 
(detection cost function), evaluated the goodness of the hard decisions, while 
secondary criteria (minDCF and DET-curves) evaluated the goodness of the 
scores 

In what follows, we shall always assume that hard decisions, if made by 
the evaluee, are made by thresholding all scores against a single fixed system- 
dependent threshold, set by each evaluee. If the evaluee believes the scores to 
be well-calibrated log-likelihood-ratios, then (s)he may use the Bayes decision 
threshold 77. Otherwise, the threshold may be tuned by the evaluee to minimize 
DCF on a supervised calibration database. 

The errors that result from the hard decisions on the supervised evaluation 
database are summarized as the empirical error-rates: Pmiss, the ratio of misses 
to target trials; and Pfa, the ratio of false-alarms to non-target trials. The 
primary evaluation criterion is defined as: 

DCF = nCmissPmiss + (1 - 7r)CfaPfa (7) 

It is important to realize that tt is a synthetic parameter, which models the 
target prior in the domain of application. It does not necessarily reflect the 
proportion of targets in the evaluation database. 

The DCF parametrization, tt, Cmiss, Cfa, can loosely be referred to as the 
DCF operating point. The DCF recipe requires the operating point to be fixed 
and known to the evaluee. Below we show how to relax this requirement. 

2.4 Bayes Risk: criterion for goodness of log-likelihood- 
ratios 

A small modification [6, to the DCF evaluation recipe makes it applicable to 
calibrated log- likelihood-ratios, rather than hard decisions: The evaluee sub- 
mits log-likelihood-ratios (rather than decisions) and the evaluator makes the 
decisions. The requirement for well-calibratedness is enforced by the fact that 
the evaluator applies the above-defined Bayes decision threshold, rj. 

The error-rates now depend on the evaluator's threshold and we indicate this 
by the notation Pmiss (??) and Pfa(?7). Since the submitted log- likelihood-ratios 
are also scores, it should be clear that if the evaluator were to sweep rj from — cx) 
to oo, then Pmiss (j?), Pfa('7) would map out the familiar ROC/DET curve. 

Let £ = ^1, €2, • ■ ■ , ^t, ■ ■ ■ be the log-likelihood-ratios computed by the sys- 
tem under evaluation for every trial, t, in the whole supervised evaluation data- 
base, so that: 

Pmiss(r7) = ^^/(^t<??), Pfa(r?) - -^ ^ /(^t > 77) (8) 



■''DCF as defined here is often referred to as actual DCF, to distinguish it from ininDCF, 
which will be defined later. 



where / is the indicator function and T and M are the sets of indices belonging 
to target and non-target trials. 

The resulting evaluation criterion, the empirical Bayes risk, is given by: 

TZ{£\tT, Cmiss, Qa) = TrCmiss-Pmiss (??) + (1 - T^)CfaPia{v) 

, , Cf (9) 

where rj — log — logit it 

^miss 

If the evaluator always applies a fixed, known DCF paranietrization, 
""i C'miss, Cfa, then nothing essential has changed. For a 'calibrated' log- 
likelihood-ratio, the evaluee could just submit £t = 54 — 7 + 77, where st is 
his original uncalibrated score and St > 7 is his original decision rule. In this 
case TZ would be numerically equal to DCF. 

But, if the evaluator sweeps 77 over a range of values, then everything changes. 
Now mere shifting will not adequately calibrate the scores. Now scaling as well 
as finer details of the calibration mapping also matter. (After taking care of a 
few more details below, we will demonstrate this experimentally.) 

The empirical Bayes risk as evaluation criterion for log-likelihood-ratios is 
discussed in detail in [TJISIIT]. It can be interpreted as: 

• A proper scoring rule, which encourages both good discrimination (i.e. a 
good DET-curve) as well as good probabilistic calibration (in the sense 
of [H]). See for example [TU], Chapter 13, the section entitled 'The honest 
weatherman', for an insightful explanation. 

• Generalized cross-entropy |11| between the evaluator's perfect empirical 
posterior given by the labels and the posterior P(target|s,7r) of the eval- 
uee. This information-theoretical analysis provides useful inequalities to 
understand the essential properties of this evaluation criterion [TJ Chapter 
2]. 

2.4.1 The default system 

Define the default system, which always outputs log-likelihood-ratio of zero, so 
that Cq = 0, 0, • • • for every trial. Notice that the posterior of the default system 
is the same as the prior: P(target|^j = 0,7r) = tt. Making Bayes decisions with 
the default system is the same as making decisions with the prior alone. 

It is easy to show fT; Chapter 2] that if the likelihood-ratios of a system, £, 
are sufficiently well caHbrated, then 7^(£|7r, Cmiss, Cfa) < TZ{Co\TT,Cnuss,Cfa), 
for every operating point tt, Cmiss, C'fa- A system that fails this test at some 
operating point can be said to be badly calibrated at that operating point. At 
such operating points, on average, better Bayes decisions are obtained by not 
using the system. 

2.4.2 Simplifying risk to error-rate 

As shown above, a system that outputs well-calibrated likelihood-ratios can be 
expected to make useful (better than default) Bayes decisions at every operating 
point. It therefore seems reasonable to expect of an evaluation procedure to 
test calibration over a wide range of such operating points. The problem is 
that the Bayes risk, as we have defined it, is parametrized by three independent 



parameters, tt, Cmiss, C'fa- How can we design our evaluation recipe to take 
account of all operating points in this three-dimensional space? 

This problem is solved by realizing that all these operating points can be 
represented by an equivalent one-dimensional range of operating points, which 
is much easier to cover with an evaluation recipe. We show how this is done. 

Define the effective prior as: 

^=^7^ Th — ^v^ ^^^^ 

and now parametrize the Bayes risk with tt and Cmiss = C'fa = 1- This repara- 
metrization leaves the Bayes decision threshold, 77, unchanged: 

(J 
r] = - logit n ~ log —^ logit n (11) 

and the evaluation criterion, TZ is merely scaled: 

7^(/:|^, 1, 1) = -^ — -j^ — ^^7^(/:|7^, Cnuss, a,) (12) 

where the scaling factor is positive and is not a function of C or of the error- 
rates. This means that if we are comparing the relative benefits of two systems, 
say £1 and £2- then: 

7^(£l|^,l,l)<7^(/:2|^,l,l) ^ 7^(/:lk,aniss,Cfa) <7^(/:2|7^,c„,i,s,Cfa) 

from which we conclude that the two criteria are equivalent for evaluation pur- 
poses]^ 

2.5 Empirical Bayes error-rate: a practical evaluation re- 
cipe 

We now define our final evaluation criterion for evaluating the goodness of log- 
likelihood-ratios. The empirical Bayes error-rate is £{C\tt) = TZ{C\tt,1,1), so 
that: 

£{C\n) = ^P„,i,,(- logit ^) + (1 - ^)Pfa(- logit ^) (13) 

This criterion is parametrized by the single, scalar parameter, tt, or equivalently 
by the Bayes decision threshold, — logit tt. Again, we refer to this parameter as 
the operating point. 

The old operating point defined by NIST for the SREs between 1997 and 
2008 was at TT fa 0.092, while the new operating point of 2010 was at tt = 0.001. 

In this toolkit, we are interested in evaluation that spans operating points. 
By having confined the operating point to one dimension, this becomes do-able. 
By sweeping over the threshold, this criterion exercises the decision-making abil- 
ity of log-likelihood-ratios in a similar way that the ROC/DET-curve exercises 
the potential decision-making ability of uncalibrated scores. In subsections be- 
low, we shall discuss two ways of sweeping the operating point: one is an integral, 
the other a plot. 



®This equivalence still holds if we allow more general cost functions, which can have negative 
costs (i.e. rewards) for correct decisions. In this case, the relationship between the criteria is 
affine, rather than linear. [5] 



2.5.1 The default system: reference for bad calibration 

We provide two references which can be compared to £{C\'k) to judge calibration 
of C. The first, discussed here, is the upper boundary where calibration fails. 
The other (the familiar minDCF), discussed in the next subsection, is an ideal 
lower bound, where calibration is optimal. 

The default system, £o, provides the reference error-rate: 

£(£0!^) =min(^,l -•?;•) (14) 

As mentioned above, a system C, for which £{C\tt) > £{Cq\tt), is said to be 
badly calibrated at the operating point tt, because then it would be better not 
to use the system. 

2.5.2 minDCF: reference for ideal calibration 

NIST's minDCF is obtained by allowing the evaluator, who has access to the 
true class labels, to choose an optimal threshold at every operating point: 

minDCF(£|7r,Cniiss,C'fa) = min 7rC,„iss-Pmiss(7) + (1 - •^)C'fa-Pfa(7) (15) 

— oo<7<oo 

Here we are interested in the specialization of minDCF, where the costs are 
unity. In analogy with £, we denote it fmin: 

£miniC\n) = minDCF(£|^, 1, 1) (16) 

Note: 

£{C\n) > f,™n(/:|^) < £{Co\^) (17) 

Like minDCF, ^min is a secondary evaluation criterion, which fulfils two func- 
tions: 

• It provides an ideal reference value for judging calibration. If £ and iS'inin 
are close, then the system can be said to be very well calibrated. 

• In the earlier stages of the development of a speaker recognition algorithm, 
one is typically not interested in calibration, but just in the potential to 
make good decisions at some operating point, ^min provides a calibration- 
insensitive criterion, which can be evaluated over a range of different op- 
erating points. 

2.5.3 Cllr: scalar summary of goodness of log-likelihood-ratios 

The BOSARIS Toolkit provides two ways to sweep the operating point: one 
integrates out the operating point to give a scalar, summary criterion; and the 
other plots the error-rate as a function of the operating point. We discuss the 
integral here and the plot in the next subsection. 

We can define the calibration-sensitive, scalar summary criterion of the good- 
ness of log-likelihood-ratios, known as Cur, by integrating out the operating 
point |7]: 



/C30 
£{£\\ogit-^x)dx 
-00 



^5:iog,(i-,e-o+^Ei°g^(i+^^') ^''^ 



where A: > is an unimportant scale factor and logit"'^ x = (1 + e^^Y 
inversqjof the logit function. 

This criterion is further discussed in [TJ [71 H] [H] • It can be interpreted as a 
strictly proper scoring rule, empirical cross-entropy, negative log-likelihood and 
as optimization objective for logistic regression. 

2.5.4 Normalized Bayes-error-rate plots 

To plot £{C\7r) as a function of the operating point, it is helpful to transform 
both the horizontal and vertical axes. 

Using ft e [0, 1] as the horizontal axis would compress interesting parts of the 
graph against the sides of the interval. We therefore use logit tt on the horizontal 
axis instead. This axis now becomes infinite in both directions and we plot only 
a suitable interval, near the origin, logit 0.5 — 0. Plotting an interval that is too 
wide is meaningless anyway, because in those regions the prior becomes so close 
to or 1 that either the miss or the false-alarm counts drop to zero. 

The vertical axis is non- linearly amplified by normalizing with £{Co\tt) — 
min(7r, 1 — tt). If this were not done, low error-rates would compress all the 
interesting action against the bottom of the plot. 

The normalized Bayes-error-rate plot can be described as a plot of {x,y) 
such that: 

f (£01 logit x) 

Figure [1] gives an example, using synthetic Gaussian scores to compare the true 
log-likelihood-ratio against some deliberately miscalibrated versions. This plot 
demonstrates: 

• The deliberately miscalibrated 'systems' have worse error-rates than the 
(green) 'true LR' system, almost everywhere. 

• The only region where the green system does worse than the miscalibrated 
dashed magenta is due to small sample effects. This is to the left of the red 
triangle, where the number of false-alarms becomes very low. The red and 
green triangles indicate the points were false-alarms and misses become 
scarce (less than 30) and therefore indicate the boundaries were small- 
sample effects may become a problem for meaningful evaluation. The safe 
region is between the two triangles. The error-rates that determine the 
horizontal positions of these triangles are obtained from the dashed black 
curve, where the e valuator has optimized the thresholds. 

• The dashed black curve is £min(£|7T). Between the triangles, it coincides 
closely to the theoretically optimal 'true LR' green curve. In real cases, 
we are not given a true probability model that generated the data, so that 
iS'min forms a useful practical reference for judging calibration. 

• The solid black line at y = 1 represents the default performance oi £ {Co\tt) . 
In places, the miscalibrated systems do worse than this reference. The only 
one which does not is the underoptimistic 0.5 x logLR. 



'^logit ^ is also known as the logistic sigmoid. 



(The reason why the dashed and soHd black lines meet just to the right of +2 
for this dataset is that the Gaussian log-likelihood-ratio as a function of the 
score is a parabola, with a minimum just below —2. The system never outputs 
log-likelihood-ratios with smaller values, so that in the far right of the plot, all 
decisions are identical to those made by the default system (i.e. accept).) 
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Figure 1: Normalized Bayes error-rate plot for a synthetic system with Gaus- 
sian scores: targets ~ A/'(/.t = 3, cr = 2) and non-targets ~ A/'(0, 1). The true 
likelihood-ratio is compared against deliberate additive and multiplicative mis- 
calibrations. 

Figures [2] and [3] show further examples of normalized Bayes error-rate plots, 
but now for real speaker recognition scores of systems submitted to SRE'lO. 

The plots show curves for tests on two databases: dev is the database used 
to train the calibration (SRE2008 eval database in this case) and eval is the 
evaluation database (SRE2010). The Bayes error-rate for the dev database is 
shown in dashed red and that for the eval database in solid red. The minimum 
Bayes error-rate (thick red) is only shown for the eval database. The toolkit can 
also plot the contributions of the misses and false alarms to both the minimum 
Bayes error-rate and actual Bayes error-rate. In the example plots, only the 
contributions to the actual Bayes error-rate are shown (misses in blue, false 
alarms in green). The new (SRE'lO) operating point is shown on the plots by 
the vertical dashed magenta line at —6.91. 
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BUT PLDA i-vector condition 2 



new DCF point 

dev misses 

dev false-alarms 

dev act DCF 

eval misses 

eval false-alarms 

eval min DCF 

eval act DCF 

* eval DR30 




Figure 2: Normalized Bayes error-rate plot for an SRE 2010 speaker detector 
with good calibration. Here eval denotes the evaluation database and dev the 
development database. Ptar = tJ", while act DCF and min DCF refer to £ and 
^min- ^iniss E^^d normalized Pf^ are shown separately. DR30 refers to the point 
to the left of which there are fewer than 30 false-alarms. The vertical magenta 
dashed line represents the new operating point at tt = 0.001. 



In the region of interest, a; < 0, which we plot in these figures, the vertical 
axis (normalized error-rate) is: 



y 



£{C\i) 



min(7r, 1 — tt) 

TT 

= Pmissiv) + exp(- logit n)P{^{Tj) 

= ^miss(- logit"^ x) + exp(-x)Pfa(- logit^^ x) 



(20) 

(21) 

(22) 
(23) 



The exponential amplification of false-alarms induced by this normalization ex- 
plains the shape of the curves for regions of bad calibration. Some form of 
amplifying normalization is needed to make the effects of calibration visible in 
regions of low error-rate. This normalization is the main difference between 
these curves and APE-curves [7]. The normalized Bayes error-rate plot is able 
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BUT i-vector full-cov condition 2 



new DCF point 

dev misses 

dev false-alarms 
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eval misses 
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eval act DCF 

* eval DR30 




Figure 3: Normalized Bayes error-rate plot for an SRE 2010 speaker detector 
with bad calibration. See caption of figure |2] for details. 

to display a wider range of operating points than the APE-curve. 

The points in the plot marked with asterisks (we used triangles in the first 
plot), labelled DR30 refer to Doddington's Rule of 30 gj. This rule suggests you 
need at least 30 false-alarms and at least 30 misses for meaningful evaluation. 
The toolkit can plot both the DR30 point for the misses (to the right of which the 
absolute number of misses drops below 30) and the one for the false alarms (to 
the left of which the absolute number of false-alarms drops below 30). These 
points are on the fmin curve, because we use the false-alarm count and miss 
count that result from the evaluator's optimized threshold. 

2.6 ROC/DET and related criteria for goodness of scores 

This subsection deals with ROC/DET curves and associated summaries such 
as EER and minDCF, all of which can be applied for calibration-insensitive 
evaluation of the goodness of uncalibrated scores. This is useful for the earlier 
stages of algorithm development, when calibration is not of immediate interest. 
We assume the reader is familiar with the ROC (receiver operating charac- 
teristic) |13| . In this section we concentrate on perhaps unfamiliar relationships 
that exist between the ROC, minDCF and EER. In summary: the ROC spans 
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operating points by plotting error-rates as a function of the threshold; minDCF 
samples the ROC at a fixed operating point; EER summarizes the span of oper- 
ating points by maximizing over minDCF as a function of the operating point. 
The ROC convex hull is central to this analysis and also provides the key to 
efficient minDCF and EER calculation. 

In our discussion below, we use the term ROC, but (unless otherwise noted) 
everything applies also to DET-curves [TJ. For ROC, we assume the speaker- 
recognition convention where x = Pfa is on the horizontal axis and y = Pmiss 
on the vertical axisj^ The DET-curve differs from the ROC by axis warpingjj 
X = probit(Pfa) and y = probit(Piniss). 

There are some aspects of the ubiquitous ROC/DET that seem to be mis- 
understood by many of its users. Here we highlight the following: 

• The ROC is an optimistic view of the decision-making ability of scores, 
because calibration is not tested. If Bayes risk is minimized (i.e. minDCF) 
at a particular operating point 'on the ROC curve', then the calibration 
problem remains of how to choose a threshold that will place the actual 
performance at this operating point. This actual performance is usually 
worse (and cannot be better) than minDCF. 

• The empirical ROC is not a continuous curve. It is a collection of discrete 
points in (Pfa, Pmiss) space, where every point corresponds to a decision 
threshold between adjacent scores. If the points are connected with line 
segmentfrj then those segments are either vertical or horizontal, corres- 
ponding to target and non-target scores. We shall refer to this plot as the 
steppy ROC. 

• minDCF operating points do not live exactly on the steppy ROC. They 
live on the ROCCH curve: the lower left boundary of the convex hull 
around the discrete points of the ROC. 

• Although the EER is fixed at Pmiss = Pfa , it nevertheless forms a summary 
of the whole curve: it is a tight upper bound of the decision making ability 
over all operating points. Using EER as optimization objective is a good 
idea, because forcing the tight upper bound down, forces the whole curve 
down. This can be generalized to any other point on the ROCCH curve 
by fixing the ratio 



fm 



We elaborate on the last two points below. 

2.6.1 The ROCCH is where minDCF lives 

Let there be n points, [p{a.{i) , Pmissi'i)] hi the empirical ROC. A point in M^ is in 
the convex hull of the ROC, if and only if it is a two-dimensional interpolation 
between all of the ROC points. That is, a point 

n 

[x,y] '^^a,[pfi,{i),p,r,iss{i)] (24) 

j=i 



°In other fields, the vertical axis is 1 — Pmiss- 

^The probit function maps [0,1] to [—00,00] in a very similar way to the logit function: 
probit(p) = V2erf-l(2p- 1). 

^"^ assuming no two scores coincide 
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is in the convex hull if and only if all a^ > and X]i=i cti — ^■ 

We already know that minDCF can be expressed either as a continuous 
minimization over the threshold (7), or as a discrete minimization over the 
ROC points. But it can also be expressed |15[ [1] as a continuous minimization 
over the convex hull, or as a discrete minimization over the set of vertices, Vch, 
of the convex hull: 

niinDCF(7r, Cmiss, Qa) = min7rCinissi^miss(7) + (1 - 7r)CfaPta(7) 

7 



= min7rCinissPmiss(i) + (1 - 7r)CfaPta(j) 

] aj(7rCmissPmiss(i) + (1 - 7r)CfaPfa(j)) 



=1 
min > I 

a ^-^ 



(25) 



= min 7rC,„issPmiss(i) + (1 - 7r)CfaPfa(i) 

where a = [ai, . . . ,a„] is subject to the above-mentioned convexity constraint. 
This means that although parts of the convex hull seem more optimistic than 
the steppy ROC, these parts do not give lower minDCF, no matter what the 
operating point. 

The DCF minima live on the lower left boundary of the convex hull, which 
forms a continuous, piecewise linear, convex curve between the points (0, 1) and 
(1, 0). We shall refer to this curve as the ROCCH curve. 

The BOSARIS Toolkit provides the functionality to compute the ROCCH 
curve, as well as the associated DET-curve obtained by applying the non-linear 
(probit) mapping to the axesFH Figure H shows two examples. For further 
examples, see ^ Chapter 7], or [16,, or try to plot some of your own, using the 
toolkit. 

The ROCCH vertex set, Vch, is typically much, much smaller than the em- 
pirical ROC. Since the convex hull can be computed efficiently (see the PAV 
algorithm below), and since it is valid for all operating points, this is the key 
to efficient minDCF computations for large score sets, over a large range of 
operating points. 

2.6.2 EER as upper bound 

The EER (equal-error-rate) is usually defined as the 'point on the ROC, where 
-Pmiss = -Pfa- For the empirical ROC, in general, no point exactly satisfies this 
equality, but it can be satisfied by interpolation. If we choose to interpolate 
between all points in the ROC, we again find ourselves on the ROCCH curve. 
We denote the point on the ROCCH curve where Pmiss = Pfa as the ROCCH- 
EER. We propose to use the ROCCH-EER as a well-defined, practical version 
of the EER and this functionality is provided as such by the toolkit. 
The ROCCH-EER has the following interesting property [T]: 



ROCCH-EER = max min ^Pmiss(7) + (1 - -^)-Pfa(7) 

7r — oo<7<oo 

= maxminDCF(7r, 1, 1) 
Figure |4] demonstrates this. 

^^The convexity does not hold when these curves are translated to DET space. 
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(26) 



EER read off ROCCH-DET 
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Figure 4: Two examples of ROCCH-DET vs classical steppy DET. The equality 
of ROCCH-EER and max minDCF is demonstrated. (Here n = Ptar)- 



ROCCH-EER is obtained by maximizing w.r.t. the operating point, while 
minimizing w.r.t. the threshold. The minimization confines us to the ROCCH 
curve, while the maximization finds the most pessimistic operating point on this 
curve. The ROCCH-EER therefore forms a tight upper bound on the Bayes 
error-rate that can be obtained with perfect calibration. By pushing down on 
the EER, we are pushing down the whole curve. 

Another way to see this is the fact that minDCF(7T, 1, 1) is a concave function 
(see figure El). If we push down at the maximum of this curve (by trying to build 
a system that gets better EER) it cannot form a dent in the curve that violates 
concavity. If anything moves, the whole curve has to go down in such a way as 
to respect concavity. 

This does not guarantee that if we reduce ROCCH-EER, we will have re- 
duced minDCF at all operating points. Even if the value of the maximum is 
reduced, its position, tt, can move in such a way that error-rates can increase 
somewhere far from the maximum. This lateral movement is roughly analogous 
to tilting of the DET-curve. If, however, we want to target a specific region of 
operating points of interest, we can generalize this idea. This is shown in the 
next subsection. 



15 



2.6.3 UER: Unequal-error-rate 

We can generalize ROCCH-EER by considering a point, [-Pfa,-Pmiss], on the 
ROCCH curve where Pfa = rPmiss- For any r 7^ 1, this is an unequal- error-rate. 
Such points also have the interpretation that they form tight upper bounds 
on minDCF. To see this, choose any costs such that: Cmiss = '"C'fa- We can 
show [T] that there exists a point, [P{ii{r), Pmiss(?')], on the ROCCH curve, such 
that: 



^iniss-' miss 



(r) = CfaPfa(r) = maxminDCF(^, C„,i,„ Cfa) (27) 



The point on the curve depends just on the ratio r. By varying r between zero 
and infinity, we can map out the whole ROCCH curve j^ If we arbitrarily set 
Cfa = 1 and Ciniss — r, we can define the unequal-error-rate as: 

UER(r) = Pfa(r) = rP„ii,,(r) = maxminDCF(7r, r, 1) (28) 

Again, this value forms a tight upper bound of a concave function of tt, so that 
using UER as optimization objective pushes down the whole curve. If we choose 
r ~ TT, then we will be targeting operating points in the vicinity of tt. 

In summary, the whole ROC/DET curve has this 'stiffness' property induced 
by the concavity, so that trying to optimize some point on the curve will tend 
to also improve the decision-making ability of the curve over a larger region. 

2.6.4 PRBEP 

Finally, we mention another variant on this idea, where we re-weight the error- 
rates to represent absolute error counts. By choosing Cmiss — T, the number of 
target trials; and Cfa = N, the number of non-target trials, the toolkit provides 
the functionality to compute the precision-recall-break-even-point: 

PRBEP = A^ X UER(— ) = TP„iiss = APfa = maxminDCF(7r, T, N) (29) 

which represents the point on the ROCCH curve where the absolute number of 
misses and false alarms are equalrj 

If the error-rates of the recognizer are low relative to the number of available 
evaluation trials, then this forms a sensible evaluation objective, which balances 
the two error-counts, keeping them both from becoming too small for as long 
as possible. 

Here we prefer to present the result as an absolute number of errors, rather 
than as an error-rate, so that if the number of errors becomes small, the user is 
effectively warned that this is happening. 

PRBEP cannot be used for meaningful comparisons across databases of dif- 
ferent sizes. It is meant for comparison of different systems on the same data- 
base. 



^^Interestingly, if we exchange max and min, the error-rates that satisfy 
niin-y max^ DCF(7|7r,r, 1), map out the steppy ROC as we vary r. 

^^Since the ROCCH curve is an interpolation, this will in general not be a whole number. 
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2.7 Fusion and Calibration 

The toolkit provides two solutions for calibration, which is the task of find- 
ing a mapping (., that maps scores to log- likelihood-ratios. In both cases, the 
mapping is 'trained' on a supervised calibration database. One solution is non- 
parametric, based on isotonic regression. The other is parametric, based on 
logistic regression. The logistic regression solution generalizes also to a fusion 
recipe P^ 

The non-parametric calibration finds a solution that is (on the training data) 
simultaneously optimal for any sensible objective functiorFj for measuring the 
goodness of calibration jT] Appendix C]. In practice however, we have found 
that the parametric solution usually performs better on independent test data. 

2.7.1 PAV: Non- parametric calibration 

The convention that the larger the score, the more it favours the target hypo- 
thesis, suggests that the calibration mapping, (,, should be monotonically rising 
(isotonic) [17!. Since we have a finite number of training scores, each of which 
must be mapped to a log-likelihood-ratio, this can be done in a non-parametric 
way. We can independently choose the value for each point, subject only to the 
monotonicity constraint. This problem is known as isotonic regression and an 
efficient implementation is given by the PAV (pool adjacent violators) algorithm, 
which we discuss in the next section. 

Attractive features of this solution are: 

• On training data, as mentioned above, it is optimal, no matter how you 
measure optimality. 

• It corresponds exactly to Ei^in (minDCF) : If a data set is optimized with 
PAV, and then evaluated on the same data set with 8 (DCF), then DCF 
= minDCF. 

• It also corresponds exactly to using the slope of the ROCCH curve as 
calibrated likelihood-ratio | I18| . 

• The type of the score distribution is unimportant. In fact, the procedure 
is invariant to any monotonic warping of the scores. In contrast, the 
parametric logistic regression calibration solution below works best for 
approximately normal score distributions. 

2.7.2 Logistic regression: parametric fusion and calibration 

The toolkit provides a logistic regression solution, which can: 

• train a calibration mapping, ^(s), for a single system; 

• train combination weights to fuse multiple subsystems into a single sub- 
system which outputs well-calibrated log- likelihood-ratios; and 

• also incorporate certain kinds of side-information, or quality measures. 

^^It is shown in ^15 , that isotonic regression can also be used for fusion, but this is not yet 
implemented in the toolkit. 

^^This is, any strict, or non-strict proper scoring rule, or Bayes risk criterion. 
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All of this functionality is provided by optimization of the parameters of the 
following mapping: 



N 



^5,;s,t+q;Wrt (30) 



where £t is the fused and calibrated output log-likelihood-ratio for trial t; N is 
the number of subsystems to be fused (if iV = 1, then the result is just calibra- 
tion); Sit is the score of subsystem i for trial t; q^ and r^ are optional 'quality 
vectors', derived from the two sides (enrol, verify) of trial t. The parameters 
to be optimized are the scalar offset a, the scalar combination weights bi and a 
symmetric matrix W, which effectively combines the two quality vectors into a 
quality score for the trial. 

The parameters are optimized with logistic regression, which minimizes an 
objective function, which is very similar to the above-defined C\\r. This ob- 
jective function is the evaluation criterion for a supervised calibration database, 
which must be provided by the user. Since the objective function is calibration 
sensitive, optimizing it causes the fused output to be well calibrated. See |19j , 
or [TJ Chapter 8] for more details. 

3 Algorithms 

This section describes the key algorithms that help the toolkit to efficiently 
process very large sets of scores. 

3.1 Efficient DCF and minDCF 

This subsection describes efficient algorithms for computing DCF and minDCF. 
With more traditional implementations, computation of £ (DCF) and ^min 
(minDCF), over the range required by a normalized Bayes error-rate plot, may 
take several minutes for large trial lists (a few million scores). By comparison, 
the implementation in the BOSARIS Toolkit takes a few seconds to execute. 

3.1.1 DCF 

To efficiently compute £, pool all the scores, C — £i,i2,- ■ ■, with all the dif- 
ferent thresholds, — logitTT,;, at which £{C\7ri) is to be evaluated. Sort them all 
together, in increasing order, keeping track of where the thresholds end up. The 
miss and false-alarm rates at threshold i are given by 

PmUi)^{U-{D-i + l)}/T (31) 

PUt)^{N-{n,-{D-z + l))}/N (32) 

where ti is the position of the ith threshold in the sorted list (after deleting 
non-target scores) , rii is the position of the ith threshold in the sorted list (after 
deleting target scores), D is the number of thresholds, T is the number of target 
scores and N is the number of non-target scores. Equation [T3| then gives £. 
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3.1.2 minDCF 

To efficiently compute fmin, compute the vertices of the ROCCH curve, using 



the PAV algorithm (see section 3.2 1. There are typically very few of these 



vertices and as shown in section |2.6.1[ the original large ROC can be replaced 
with these vertices, without changing the value of minDCF. Then use the last 



line of (251 



3.2 The PAV Algorithm 

The PAV (pool adjacent violators) algorithm is central to the efficient imple- 
mentation of many of the toolkit functions. We use it to efficiently compute the 
vertices^ of the ROCCH curve fT8_. Once we have these vertices, we can com- 
pute minDCF, EER, UER, PRBEP and the non-parametric calibration mapping 
(see the relevant subsections in the theory section). 

The PAV algorithm solves the problem of assigning a likelihood-ratio to 
each score in some supervised database of target and non-target scores. The 
likelihood-ratios are adjusted non-parametrically and independently, subject 
only to the monotonicity constraint that if the scores are sorted, then the 
likelihood-ratios must also be sorted. The PAV solution turns out to be simul- 
taneously optimal for any proper scoring rule and therefore for any Bayes risk 
criterion, with any cost function and any prior jT] Appendix C]. 

The PAV algorithm complexity is linear in the number of scores and the 
preceding sort has complexity of order Tlog(r). In our implementation, sorting 
and applying PAV takes a few seconds for a few million scores. 

3.3 New logistic regression optimizer 

The BOSARIS Toolkit uses a general-purpose, unconstrained convex optimiza- 
tion algorithm to train the logistic regression fusion and calibration solutions. 
It uses a quasi-Newton method, which is faster, generally better behaved and 
converges to a better solution than the conjugate gradient optimizer which was 
used in its predecessor, the FoCal Toolkit p^ 

The new optimizer uses the trust region Newton conjugate gradient algorithm 
for large-scale unconstrained minimization |20l |5T] . 

4 Code 

This section gives a high-level overview of some of the salient features of the 
implementation of the algorithms. More detail is available in the user manual 
which is distributed with the toolkit. 

The current implementation is written in MATLAB, with an object-oriented 
API (application programmer's interface). The objects are not an essential part 
of the codej^they are just a way to organize the API. If this type of interface 
turns out to be a hindrance rather than a help to users, it would be possible to 
replace this API. 



^®The vertices of the whole convex hull are the same as the vertices (cusps) of the piece- wise 

linear ROCCH curve. 

^^ Available at: http://sites.google.com/site/iiikobrummer/focal 
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MATLAB object oriented code does not scale well to large problems. 
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The main feature of the code that remains to be highHghted in this last 
section is the efficient, binary, platform-independent score file format. The effi- 
ciency of the format relies on the assumption that trial lists can be represented 
as dense matrices, where the row and column indices are the two sides (enrol, 
verify) of a trial. We assume that each enrolment or each verification side is 
to be matched against many — or even all — others. (Such dense score matrices 
were necessary for ensuring an adequate number of non-target trials and there- 
fore an adequate number of false-alarms at the new operating point, n = 0.001, 
of SRE'lO.) 

We use a platform-independent HDF5 binary score format to encourage in- 
teroperability with other tools. Text files would also give interoperability, but 
are much larger and much slower to process. 

4.1 Data 

The code in the toolkit is primarily concerned with storing and manipulating 
the following data types: 

indexes list model and test segment names and indicate which pairs of model 
and test segment are in the trial list described by the index. 

keys are similar to indexes, but also give the answers i.e. which trials are target 
trials and which are non-target trials. 

scores store scores for a list of trials (specified by an index or a key). In 
addition to the actual scores, a score object contains all the information 
that an index describes. 

quality measures can be seen as scores for a model or test segment (instead 
of for a trial). These can be fused with ordinary scores (see section [277| . 

Indexes can be used: 

• for aligning scores from different systems before fusing them 

• for selecting parts of score objects of interest (e.g. those for male trials) 



• 



by external code that produces scores. This code can load an index file 
which indicates which segment pairs to produce scores for. 



Two score objects can be merged to make a new score object provided that 
they don't provide scores for the same trial. Parts of score objects can be 
selected (to produce a new score object) either by using an index or by using 
lists of models or segments to discard. 

4.2 Plots 

The toolkit can produce two types of plots: 



DET plots (see section 2.6) either from points on the ROC or from the 
ROCCH curve. 
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Normalized Bayes error-rate plots (see section 2.5.4). Both minimum and 
actual Bayes error-rate curves can be plotted, as well as curves showing 
the contributions of the misses and false alarms, respectively, to those 
curves. A vertical line indicating the operating point can be placed on the 
plot. 



DR30 points (see section 2.11 for misses and false alarms can be placed on 
both of types of plots. 

4.3 Calibration 

The high level wrapper functions for calibration have two variants: those that 
train the calibration transformation on a single set [^ of scores and then apply 
that transformation to the same set, and those that train the transformation on 
one set of scores (dev) and apply it to another set {evaVj. A second partitioning 
of the functions can be made according to whether the transformation is affine 
or whether it uses the PAV algorithm (see section [3^ . 

4.4 Fusion 

The main functions for doing fusion can again be divided (as for calibration) 
according to whether there is a set of unsupervised eval scores in addition to 
the dev scores or not. There are separate wrapper functions for doing fusion 
when quality measures are to be used. 

4.5 Other functions 

There are functions for calculating EER, minimum DCF, actual DCF, PRBEP 
and the effective prior. 

4.6 File format 

With approximately eight million trials in our development list for SRE'lO, 
loading and saving score files in text format became unfeasible. We therefore 
created a binary file format which both reduced the size of the file on disk and 
made loading and saving faster. For example, one of our tel-tel development 
files is about 60 times larger on disk in text format than in binary format and 
the binary file loads about 160 times faster than the text file. 

The binary score files contain two lists and two matrices. The lists contain 
the model and test segment names. One matrix contains the scores as real 
numbers and the other matrix is a logical matrix of the same size which indicates 
which scores correspond to valid trials. The dimensions of the matrices are the 
number of models by the number of test segments and the score at position (i, j) 
is for a trial between the ith model in the model list and the jth test segment 
in the test segment list. 

The toolkit provides both Matlab .mat and HDF5 versions of the binary 
format, as well as functions for converting between binary and text formats. 



'By set, we mean multiset, because the collection should retain duplicate values. 
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